This document describes the data created for APRA’s fundraising data science online learning courses and workshops. All of the data created for these purposes is fictitious.
There are three data sets available as of 2020-08-12:
Each of these tables and the variables contained within each are described below. There are tabs included throughout the document that can be used to explore the variables included in each data set.
These data sets are designed to mirror realistic fundraising data and are not intended to be perfectly “clean” data. There are common fundraising data challenges built into the data files. For example, you can click on the Biographical Data tab above to learn more about that data set.
All of the code for this project is available on GitHub. The code that generates the data sets can be found in the generate_data.R r script.
The individual datasets can be read into R directly from github as follows.
# load the tidyverse library
library(tidyverse)
library(knitr)
# read bio data csv into R and store in a data frame named bio
bio <- read_csv("https://raw.githubusercontent.com/majerus/apra_data_science_courses/master/bio_data_table.csv")
bio %>%
sample_n(10) %>%
select(id, name, birthday, city, state, capacity, capacity_source) %>%
kable()
| id | name | birthday | city | state | capacity | capacity_source |
|---|---|---|---|---|---|---|
| 6978639 | Tinner, Stephanie | 1951-05-30 | Jefferson city | MO | $1k - $2.5k | institutional |
| 8941505 | el-Akram, Nawaar | 1956-05-28 | Long beach | CA | $75k - $100k | screening |
| 4242470 | Barela, Elizabeth | 1979-07-24 | Mckinney | TX | $500k - $750k | NA |
| 4258011 | Walsh, Dante | 1970-01-05 | Gibsonville | NC | $10k - $25k | screening |
| 6843645 | Dhindsa, Daniel | 1963-03-16 | Greeley | CO | $25k - $50k | screening |
| 3717480 | Geisert, Jon | 1969-07-29 | Waxahachie | TX | $50k - $75K | screening |
| 7456756 | Jain, Summer | 1999-01-25 | Crestwood | KY | $25k - $50k | screening |
| 8827814 | Bates, Ryan | 1951-09-25 | Greeley | CO | NA | institutional |
| 8891271 | Mills, Carly | 1923-03-06 | Cincinnati | OH | $100k - $250k | screening |
| 6234290 | Bagaporo, Hai | 1962-03-15 | Eastpointe | MI | $50k - $75K | NA |
The biographical data has 14 variables and 100,000 observations. The data is stored at the donor level. Each row of the data represents a unique donor and biographical information about that donor.
There are 4 numeric variables:
## Rows: 100,000
## Columns: 4
## $ id <dbl> 8275707, 2963581, 4302254, 7637444, 9369155, 1026439, 65…
## $ household_id <dbl> 1000235, 1000235, 1000303, 1000341, 1000341, 1000435, 10…
## $ lat <dbl> 34.03, 41.29, NA, 36.07, 26.23, 33.60, 40.99, 38.82, 32.…
## $ lon <dbl> -117.75, -92.63, NA, -94.15, -80.13, -117.71, -74.34, -7…
When loaded by default there are 9 character variables:
## Rows: 100,000
## Columns: 9
## $ name <chr> "al-Shakoor, Labeeb", "Nero, Brianna", "al-Rasheed, R…
## $ country <chr> "United States", "United States", "China", "United St…
## $ city <chr> "Pomona", "Oskaloosa", "Shenzhen", "Fayetteville", "P…
## $ deceased <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y"…
## $ zip <chr> "91766", "52577", NA, "72701", "33069", "92653", "074…
## $ state <chr> "CA", "IA", NA, "AR", "FL", "CA", "NJ", "VA", "TX", N…
## $ capacity <chr> "$50k - $75K", "$1k - $2.5k", "$5k - $10k", "$2.5k - …
## $ capacity_source <chr> "screening", "screening", "screening", "institutional…
## $ race <chr> "Non-Hispanic white", "Non-Hispanic white", "Asian", …
There is 1 date variable:
## Rows: 100,000
## Columns: 1
## $ birthday <date> 1923-11-18, 1925-03-18, 1924-08-28, 1923-05-14, 1921-10-11,…
The giving data has 6 variables and 540,000 observations. The data is stored at the gift level. Each row of the data represents a unique gift and attributes associated with that gift.
There are 4 numeric variables:
## Rows: 540,000
## Columns: 4
## $ household_id <dbl> 2231010, 9276150, 4132585, 6308003, 1235119, 8185048, 11…
## $ id <dbl> 2004705, 3496504, 1679611, 9229575, 3229105, 9841718, 57…
## $ gift_id <dbl> 1000064, 1000392, 1000612, 1000726, 1000853, 1000937, 10…
## $ gift_amt <dbl> 24474, 590, 530, 222, 691, 431, 250, 984, 12103, 184750,…
When loaded by default there is 1 character variable:
## Rows: 100,000
## Columns: 9
## $ name <chr> "al-Shakoor, Labeeb", "Nero, Brianna", "al-Rasheed, R…
## $ country <chr> "United States", "United States", "China", "United St…
## $ city <chr> "Pomona", "Oskaloosa", "Shenzhen", "Fayetteville", "P…
## $ deceased <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y"…
## $ zip <chr> "91766", "52577", NA, "72701", "33069", "92653", "074…
## $ state <chr> "CA", "IA", NA, "AR", "FL", "CA", "NJ", "VA", "TX", N…
## $ capacity <chr> "$50k - $75K", "$1k - $2.5k", "$5k - $10k", "$2.5k - …
## $ capacity_source <chr> "screening", "screening", "screening", "institutional…
## $ race <chr> "Non-Hispanic white", "Non-Hispanic white", "Asian", …
There is 1 date variable:
## Rows: 100,000
## Columns: 1
## $ birthday <date> 1923-11-18, 1925-03-18, 1924-08-28, 1923-05-14, 1921-10-11,…